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We suggest a simple deterministic approximation for tlie growth of tlie favoured-allele frequency during a se- 
lective sweep. Using this approximation we introduce an accurate model for genetic hitch-hiking. Only when 
Ns < 10 {N is the population size and s denotes the selection coefficient), are discrepancies between our 
approximation and direct numerical simulations of a Moran model noticeable. Our model describes the gene 
genealogies of a contiguous segment of neutral loci close to the selected one, and it does not assume that the 
selective sweep happens instantaneously. This enables us to compute SNP distributions on the neutral segment 
without bias. 



I. INTRODUCTION 



Gene genealogies under neutral evolution are commonly 
described by the s o-called coalescent process ('Hudson", 
[1983, 1990, 2002; Kingman, 1982; Nordborg, 2001), in- 
corporating recombination, geographical and demographical 
structure. An important question is how gene genealogies are 
modified by deviations from neutrality due to positive selec- 
tion. The answer to this question would help understanding to 
what extent and in which way selection has shaped the empir- 
ically observed patterns of genetic variation. 

Many authors have addressed this question by consid- 
ering the effect of a selective sweep at a given locus on 
the gene history at a neighbouring neutral locus. The dy- 
namics of the selective sweep itself has been modelled in 
different ways. Most commonly, a deterministic model 
of the dy namics of the favou re d-alle le frequency has been 
adopted ('Braverman ef a/.', 1995'; KiM and Stephan", 
I2OO2I: Iprzeworski, 2002; Stephan et qLT992.). a n otable 



exception being the early work of Kaplan et al. ( 1989b . Any 
deterministic model is of course an approximation to a more 
appropriate model, such as Moran or Wright-Fisher models 
of directional selection, where the allele frequencies fluctu- 
ate randomly in time. The reasons for attempting to ignore 
these fluctuations are practical ones: the exa ct simulations are 
very time consuming (IKaplan et al.l\l9S^ . and, in addition, 
deterministic models are much more amenable to theoretical 
analysis than the stochastic models. 

Recently. lbuRRETT and SCHWEINSBERgI(|2004 have dis- 
covered an elegant asymptotic model (referred to as the 
'DS-algorithm' in the following) for the genealogy of a 
single neutral locus during a selective sweep occurring in 
its vicinity. As the population size N tends to infin- 
ity, their c oalesc ent process approximates the Moran model 
(IMoran , 1958h with re combination and positive selection. 
IDurrett and Schweinsberg (2004) have ai-gued that the 
fluctuations of the favoured-allele frequency during a se- 
lective sweep may have a significant effect on the gene- 
genealogy of a neighbouring neutral locus, and hence on 
the distribution of single-nucleotide polymorphisms (SNPs) 
at that locus. In a range of parameters determined by 



IDurrett and SchweinsbergI (|2004 . the DS-algorithm 
describes the effect of a selective sweep on the gene genealogy 
of a neutral locus nearby very accurately, in close agreement 
with numerical simulations of a Moran model. 

In this paper we suggest an efficient alternative to the DS- 
algorithm which is equally accurate for the param eters con- 
sidered in (DuR RETT and SchweinsbergI |2004 . as shown 
in Fig. [8] For practical purposes, our algorithm has a num- 
ber of advantages. First, it allows for SNPs to occur during 
the selective sweep because we do not assume that the sweep 
happens instan taneously as does the pa int-box construction 
(fSCHWEiNSBERG and DurrettI l2005b . This avoids a bias 
in the patterns of genetic variation at the neutral loci when the 
number of lines in the sweep is not untypically small. Second, 
in practical applications, the question usually is how selection 
affects genetic variation in a contiguous stretch of neutral loci, 
whereas the DS-algorithm describes the gene genealogy of a 
single locus. Our algorithm, by contrast, determines the an- 
cestral recombination graph of an entire segment of neutral 
loci close to a selected one. For example. Fig. |8] was ob- 
tained by a single run of our algorithm. Third, our new al- 
gorithm gives an accurate description of selective sweeps in a 
much wider parameter range than the a lgorithm proposed by 
IDurrett and SchweinsbergI (|2004 . 

On the theoretical side, we propose an efficient and accurate 
method for averaging over the fluctuations of the favoured- 
allele frequency. Our scheme gives rise to a determinis- 
tic approximation to the time-dependence of the favoured- 
allele frequency during the sweep which, however, is very 
different from the commonly used logistic model. Our 
model is as easily implemented as the logistic model, but 
much more accurate: it gives a very good description of 
the genealogy of contiguous stretch close to a selected lo- 
cus provided Ns > 10 where s parametrises the selec- 
tive advantage of the favoured alle le. By contrast , the DS- 
algorithm ( Schweinsberg and DurrettI l2005l) requires 
r\og{2N)/s ^ 1 in order to be accurate, where r is the re- 
combination rate per individual per generation between the 
selected and the neutral locus. The logistic model requires 
very strong selection and large population size (see Figs. 8- 
10). 
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The remainder of this paper is organised as follows. In sec- 
tion |II] we give a brief account of previous models of selec- 
tive sweeps and their influence on the genealogies of nearby 
loci (usually referred to as 'genetic hitch-hiking', see below). 
In section HiH we describe our implementation of th e Moran 
model. As IDurrett and SchweinsbergI (l2004l) we em- 
ploy Moran-model simulations as a benchmark for our new 
algorithm. This new algorithm rests on two parts: a determin- 
istic model for the favoured-allele frequency during the sweep 
(described in sectionUVji and the coalescent process for a con- 
tiguous segment of neutral loci on the same chromosome as 
the selected locus (section [V]l. In section [Vll we summarise 
our results, and conclude in section IVlIl 



II. SELECTIVE SWEEPS AND GENETIC HITCH-HIKING 

A. Selective sweeps 

Consider the genetic composition at a certain locus in a 
diploid population with a constant generation size N. Sup- 
pose all 2N gene copies were of the same form b when a new 
allele B appeared due to a beneficial mutation. Let the new 
allele B have a fitness advantage (parametrised by s) as com- 
pared to the wild-type allele b. The frequency x{t) of allele 
B at time t is a stochastic process which exhibits a tendency 
to grow, but which may also become fixed at a; = (due 
to genetic drift) corresponding to the extinction of allele B. 
Once x{t) has grown sufficiently from the initial low value 
a::(0) = l/^N, the probability of reaching a; = 1 is high; 
eventually B takes over the population. This process is usu- 
ally referred to as a 'selective sweep'. In the limit of infinite 
population size, a selective sweep is well approximated by the 
deterministic model 



dx 
'dt 



s x{l — x), 



(1) 



see IDurrett and Schw einsbergI (|2004 and the refer- 
ences cited therein. Eq. ([T]i is called the 'logistic-growth equa- 
tion'. 

This growth model is a deterministic approximation to the 
stochastic growth of x{t). The latter is usually modeled 
in terms of the Wright-Fisher model ( IFisherI [1930/19991 
IWRIGHli[l93Tl) with directional selection. This is a haploid 
population model with non-overlapping generations where re- 
production is described by a biased sampling procedure with 
replacement: chromosomes are sampled randomly, with re- 
placement, from the previous generation, so that the ratio of 
the probabilities of choosing a chromosome with the favoured 
allele to that without the favoured allele is 1 : (1 — s). Direct 
numerical simulations of the Wright-Fisher model are com- 
monly employed to determine strengths and weaknesses of 
deterministic approximations such as eq. ([TJ. 

In the following we do not employ the Wright-Fisher model 
as a reference, but a closel y related rnodel with overlapping 
generations introduced b y IMoranI (Il958b . As shown by 
IEtheridge et al\ (l2006h it approximates the Wright-Fisher 
model when the population size is large. 




Time 

FIG. 1 Illustration of the hitch-hiking effect on the ancestral lines of 
a neutral locus. The shaded area corresponds to individuals with the 
advantageous allele B at the selected locus in the population. Close 
to the selected locus, most lines are identical by descent to the orig- 
inator of the sweep (line (iii)). Recombination (shown as dashed 
lines) can cause a line to escape the sweep, i.e. the originator does 
not belong the ancestral line, because at some stage a recombina- 
tion event causes the allele at the neutral locus to be inherited from 
an ancestral line that has not yet been caught the sweep (line (i)). 
Much less likely, but still possible, is for the line to first escape but 
later recombine back into the path of the sweep (line (ii)). After 
IDurrett and SchweinsbergI i2004) . 



B. Genetic hitch-hiking 

Consider the genetic variation at a neutral locus on the same 
chromosome as the selected locus. Clearly, the pattern of ge- 
netic variation at the neutral locus is influenced by a selec- 
tive sweep in its vicinity - the smaller the distance the larger 
the influence. When the B allele first appeared in the popula- 
tion because of a favourable mutation, the corresponding al- 
leles at the neutral locus have more offspring compared with 
other alleles not associated with the B allele on the selected lo- 
cus. Thus, the favoured alleles at the neutral locus are spread 
through the population to a larger extent than can be explained 
in a ne utral model. This effect is kno wn as genetic hitch- 
hiking dMAYNARD Smith and HaigiA il974.) . Far from the 
selected locus, recombination will effectively eliminate link- 
age between the neutral and selected loci, so that the influence 
of the selective sweep becomes negligible. 

Figure [T] illustrates the hitch-hiking effect in terms of the 
ancestral graph for a small hypothetical sample of sequences 
taken at a neutral locus. (For the sake of clarity we assume 
that the selected locus is located left of the neutral locus of in- 
terest.) Most ancestral lines can be traced back to the origina- 
tor of the sweep, but some lines exhibit recombination events 
allowing them to escape from the sub-population with the B 
allele. 

It is straightforward but cumbersome to directly simu- 
late the Wright-Fisher (or Moran) model in order to anal- 
yse how patterns of genetic variation are affected by hitch- 
hiking. Several authors have therefore studied approximations 
to the growth process of the selected allele frequency x{t). 
IKaplan ef^(ll989[) divide the selective sweep into three 
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phases: the early phase is modeled by a supercritical branch- 
ing process, the middle phase is described by the deterministic 
logistic growth, and the final phase is viewed as a sub-critical 
branching process. The probability that the sweep succeeds is 
approximately given by the selective advantage s, when s is 
small. As a consequence, one may need to iterate this proce- 
dure many times to collect enough successful simulations. 

This approach has been simplified b y ignoring the initial 
and fi n al (stochastic) phase s (see, e.g., IBraverman et al. 



culated within the Moran model. 



199?, "kiM and Stepha 
Stephan ef a/., 1992) and instead using the deterministic 
logistic model ^ for the whole sweep. This makes it 
possible to simulate the sweep backwards in time, which 
in turn enables one to perform computations conditional 
on that the sweep succeeds. This approach is significantly 
faster than an a l gorith m based on the better approximation by 
IKaplan ef ^Jl989l). 



lases (see, e. g.. IBraverman ef al . 
A^ l2002t IPrzeworskJ. 12002 : 



BartonI 41998*) (see also I Qtto and BartonI 1 19971) has 



considered a stochastic shift between the introduction of the 
favoured allele and the onset of the deterministic growth; the 
distribution of the shift is derived from modelling the spread 
of the beneficial allele in the initial phase of the sweep as 
a super-critical branching process. This approximation cap- 
tures some of the effects of the conditioning on the success of 
the sweep and the stochastic growth in the early stages of the 
sweep. The middle and late stages of the sweep are treated 
in the logistic approximation. Within his model. Barton gives 
analytical expressions for the probability that two copies of a 
neutral marker are identical by descent, assuming that any re- 
combination eve nt leads to ancestral lines escaping the sw eep. 

As argued by IDurrett and SchweinsbergI (|2004 . the 
disadvantage of ignoring the fluctuations is that the probabil- 
ities of how lines merge and recombine are not correctly de- 
scribed. They consider the gene genealogy of a selected locus 
and a nearby neutral locus and propose an elegant approxima- 
tion to the Moran dynamics, valid in the limit of large popu- 
lation size and strong selection, which captures the stochastic 
aspects of the sweep, and correctly models the partitioning of 
the neutral lines as a consequence of the selective sweep. 



III. THE MORAN MODEL OF POSITIVE SELECTION 



In this section we describe the Moran model (IMoranI 
fl958l) for the evolution of a diploid population of N individ- 
uals. The Moran model is used as a benchmark to test the ac- 
curacy of our coalescent model described in sections HV] and 

m 

We consider a chromosome with a locus subject to positive 
selection and determine both the evolution of this selected lo- 
cus, as well as genealogies of neutral loci in its vicinity. In 
the first subsection we describe the growth of the favoured- 
allele frequency in the population. In the second subsection 
we explain how to condition this process on the success of the 
selective sweep. This is necessary because in trying to de- 
duce the effect of a sweep on neutral loci nearby we assume 
that the sweep actually took place. In the last subsection we 
summarise how gene genealogies of such neutral loci are cal- 



A. Spread of the advantageous allele during the sweep 

As in the previous section we assume that there is a 
favoured allele at the selected locus, B say, and a set of selec- 
tively neutral variants, which we will refer to collectively as b. 
The life-time of each individual is taken to be an independent 
exponentially distributed variable with expected value of one 
generation. When an individual dies, it is replaced with a copy 
of an individual chosen with replacement with uniform prob- 
ability from the whole population, except that replacement of 
an individual with the B allele with an individual with the b 
allele is rejected with probability s; this is what constitutes 
selection in this model. Instead, a parent is chosen with uni- 
form probability from the set of individuals with the B allele. 
Thus, s — Q corresponds to neutral evolution and s = 1 is the 
strongest possible selection. In short, the population evolves 
according to a time-continuous Markov process where the dif- 
ferent events occur with rates 



Wb-b = 2A^ X ( 1 
Wb^B = 2iV X 
WB^b = 2iV X 
wb^b = 2A^ X 
2N X 
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X 1 



2N 



(2) 



The three factors in the rates Wa^is, where a and /3 stand for 
either b or B, have the following interpretations: The first fac- 
tor is the total rate of replacement events in the population per 
generation; the second factor is the probability that the line 
that dies has the allelic type a; the final factor is the probabil- 
ity that the replacing line has the allelic type (3. The second 
term in the rate wb^b corresponds to the rejected B-to-b re- 
placements. It follows from eq. (|2|i that the sum of events is 
2N per generation for all values of s. 

IDurrett and SchweinsbergI (|2004 use a slightly dif- 
ferent version of the Moran model with positive selection. 
In their model, the rejected B-to-b transitions are ignored, 
whereas we take them to be B-to-B transitions. This differ- 
ence does not affect the trajectory of the number of copies of 
the advantageous allelic type. A third possibility would be 
to introduce selection by means of a probability of survival to 
maturity which would be 1 for B alleles, but 1 — s for b alleles. 
The corresponding modifications of eq. ^ would require mi- 
nor changes to the background coalescent described in section 
rvl but we do not discuss these here. 
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B. Conditioning on tlie fixation of allele B 

In each replacement, the number of copies k of allele B 
in the population is either increased by one (corresponding to 
a b — > B event), decreased by one (corresponding to a B ^ b 
event), or left unchanged (corresponding toaB ^Borb ^b 
event). Consider the number ki of copies of the advanta- 
geous allelic type in the population after the ith change in k. 
The sequence ki,k2, . . . then follows a Markov chain, where 
the probability that k is increased by one after a replacement 
where k changes is 



X 10 



Wh^B 



1 



2-s 



(3) 



The probability hk of fixation of the B allele in the population, 
given that there are k copies at present, equals the probability 
of fixation after a change in k. With the probability that k 
increases in (O, one obtains the recursion 



hk 



hk+l + f 1 - TT—] hk-1 



(4) 



where k is between 1 and 2N — 1. If fc is zero, there are 
no copies of B that can reproduce; hence, ho = 0. Simi- 
larly, when k — 2N all individuals in the population has the 
B allele, corresponding to h2N = 1- With these two condi- 
tions the recursio n has a unique solution, given by (see, e.g., 
lDURRETill2002l and references therein) 



hk = 



2N ■ 



I-(I-S) 



(5) 



Usually, the population size is large and the selection param- 
eter is small. If in addition 2Ns is large, we obtain the well- 
known result that the probability hi that the sweep succeeds 
from a single copy of the B allele is approximately s. This 
means that if the sweep is initiated with a single copy of the 
B allele, and the rates are given by Q, in most cases the B 
allele will become extinct in a few generations because of the 
fluctuations in the early stage of the sweep. When k reaches 
a critical level (where ks is relatively large), the probability 
that the fluctuations will cause B to become extinct becomes 
exponentially small; thus, a sweep that escapes this level will 
almost certainly continue to increase in abundance and even- 
tually become fixed in the population. 

In this paper, we consider only sweeps that succeed. It is 
thus necessary to consider the Markov chain conditioned on 
the success of the sweep. The conditioning does not change 
the rate of events replacing an individual for one of the same 
kind, since they d o not affect the success of the sweep. The 
new rates become (IDurrett and S chweinsberg. 2004) : 



w;b^B(fc) = Wb^B{k) 

WB^b{k) = Wb^B{k) 
WB^B(fc) = WB^Bik) , 

Wb^b{k) = Wb^b{k) . 
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, (6) 
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FIG. 2 Growth of the favoured- allele frequency in the population 
(time is measured in generations). The population size is A'^ = 10*, 
and the selection parameter is s = 0.01. Shown are four samples 
of the Moran process (grey lines), the logistic model (dashed red 
line), and our new deterministic model described in section IIV.BI 
solid black line. The new deterministic approximation {2b\ is much 
closer to the Moran curves than the logistic approximation. 



where lo ^ 1 — s. Thus, we can simulate the embedded 
Markov chain of the changes in k, conditioned on the suc- 
cess of the sweep if we take the probability p+{k) of going 
from klok + 1 copies of the B allele as 



P+{k) 



Wb^B 



1 - c^^-+l 



Wb^B+WB— b [l + Uj){l - Uj'') 



(7) 



The probability that the number of alleles decreases from k to 
fc — 1 is p_(fc) = 1 — P+{k). 

Fig. |2] shows four realisations of the favoured-allele fre- 
quency x{t) generated with the algorithm described above. 
Also shown is the logistic model for x{t) (dashed line) which 
is not a good approximation, as well as our new model de- 
scribed in section HV] solid line. 



C. Gene genealogies of the neutral loci during the sweep 

In this section, we describe our implementation of the 
Moran model for simulating the gene genealogies of neutral 
loci in the neighbourhood of a selected locus. The algorithm 
is divided into a forward and a backward phase. 

In the forward phase, we generate the sequence of the num- 
ber fc of B alleles, forward in time, according to the con- 
ditioned Markov process described in the previous section: 
starting from fc = 1, fc is incremented with probability p+(fc), 
or decremented with probability 1 — p+(fc), until fc = 2N. 
Because we either increase or decrease fc, each value in the 
sequence is different from the previous one. 

In the backward phase, the population is divided into two 
sub-populations with B or b alleles at the selected locus. At 
the end of the sweep, all ancestral lines are in the B popu- 
lation; this is the starting point for the backward phase. We 
trace the genealogies of the neutral loci backward in time by 
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traversing the sequence of k values (obtained in the forward 
pass) in reverse; this guarantees that the time-reversal of the 
Moran process is correct. Each time k changes, we generate a 
b ^ B event if the new value of k is smaller than the old one. 
Correspondingly, we generate a B b event if k increases. 
Between each change in k, we generate the B — > B and b ^ b 
events of the Moran chain (these events does not change k). 
The number m of such events has a geometric distribution, 
qk (1 - qkY", where 



gfe = (2 - s) 



2N 



1 



k 

2N 



The probability that the event is a b ^ b replacement is 



Wb^h 



(27V - fc)2 



wb^b + Wb^b {2Ny - (2 - s) k {2N - k) 



(8) 



(9) 



and, correspondingly, the B — > B replacements occur with 
probability wb^b/(w'b^b + w\,^b). Finally, the time be- 
tween each event is exponentially distributed with expected 
value (2iV)^^ in units of generations. 

We now describe the effect of the events generated during 
the sweep on the gene genealogies of the neutral loci. In each 
event, we choose the line to die and the line to replace it ran- 
domly from the appropriate sub-populations. As we proceed 
backward in time, the dying line coalesces with its parent line 
(e.g, in a B ^ b event, we pick the line to coalesce from the 
b sub-population). With probability r, recombination occurs 
between the selected locus and the right-most locus during 
the coalescent. In this case, the region between the selected 
locus and the recombination point coalesces with the chosen 
parent, and the second part of the neutral region, between the 
recombination point and the rightmost locus, coalesces with a 
parent chosen with uniform probability from the whole pop- 
ulation. We assume that the neutral locus of interest is suf- 
ficiently small so that there is at most one crossover event in 
the region in each meiosis (the deterministic coalescent mod- 
els, however, are not subject to this limitation since in these 
models the recombination rate can be arbitrarily high). For 
the values of r considered in this article this approximation 
is good. If necessary, it is straightforward to improve it, for 
instance by simulating an explicit recombination process in- 
stead of simply assuming that no or one crossovers occur in 
the interval in each meiosis. One may also implement more 
realistic models of recombinatio n, e.g. models which capture 
crossover interference (see, e.g., IMcPeek and SpeeH 1 19951 
for a review); for the purpose of this paper, however, the sim- 
plest model is sufficient. 

When the simulation has reached the beginning of the 
sweep, there is exactly one line carrying the B allele, and the 
genetic material of this individual is ancestral to all genetic 
material trapped in the sweep. In addition, there may be a 
set of lines which have escaped the sweep because of recom- 
bination as explained in section We then follow the lines 
carrying genetical material from the sample back in time un- 
til the most recent common ancestor of each locus has been 
found for the sample. Since there is no selection in this part of 
the history, the Moran process is a coalescent where the rate 



(in units of events per generation) of two lines coalescing is 
n(n — 1)/2N, where n is the number of lines in the popula- 
tion, and the rate of recombination is r. 



IV. AVERAGING OVER REALISATIONS OF THE SWEEP 



IDurrett and SchweinsbergI (|2004 have convincingly 
shown that it is necessary to consider the fluctuations of the 
favoured-allele frequency (displayed in Fig. ^ in order to ac- 
curately represent effects of the sweep on nearby loci. 

We now explain how to efficiently and accurately average 
over such fluctuations. We motivate our method by an exam- 
ple: how to compute the probability that the first recombina- 
tion event, if it occurs during the sweep, occurs with an indi- 
vidual not carrying the favoured allele at the selected locus. In 
section [V] we describe a coalescent process which makes use 
of the ideas described in this section. 



A. An example 

We illustrate our approach by considering the conditional 
probability Q{r) that the first recombination event, if it occurs 
during the sweep, occurs with an individual not carrying the 
favoured allele at the selected locus: 



Qir) 



dtr e 



(10) 



Q{r) depends on the realisation of x{t) of the sweep of du- 
ration r. For small values of r, it is unlikely that a given 
line experiences more than one recombination event during 
the sweep, and in this case Q{r) is approximately the proba- 
bility that the line escapes the sweep. 

Fig. [3] shows the average {Q{r)) over realisations of x{t) 
as a function of r, obtained from Moran-model simulations 
(circles). Also shown are the results from the logistic model 
(dashed line), derived as follows. Inserting the solution of ([TJ 



xit) 



1 



1 + Q-s{t-T/2) 



(11) 



(where t = 2 ln(2A'' — l)/s is the duration of the sweep in the 
logistic model), into ( fTOb and expanding the integrand in ( fTOl i. 
we obtain 



(g(r)) = l-e— /2 + ^(-l)"2r 



-rr/2 _p — nsr/2 



(12) 



As can be seen in Fig. [S] the result (fT2] | deviates significantly 
from the Moran-model results. 

We now show how to obtain a much more accurate approx- 
imation (solid line in Fig.[3]l. 

The problem in averaging ( fTOl i over different realisations of 
the stochastic Moran sweep lies in that both the upper bound 
T of the integral and the integrand fluctuate. In the follow- 
ing we describe an approximate method of averaging ( fTOl ) 



FIG. 3 Comparison of {Q{r)) as a function of r for the different 
models: Moran simulations (circles), the deterministic logistic model 
(dashed red line), and the new deterministic model (solid blue line). 
The population size is A'^ = 10'' and the selection parameter is s = 
0.01. 
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FIG. 4 Comparison between the actual growth curve ki versus ti, 
red line, and the corresponding sorted curve ki — i versus ti, black 
line. The parameters are A'^ = 10'^ and s — 0.01. 



which gives accurate results and motivates a new determin- 
istic model for selective sweeps. To begin with, note that x{t) 
is piecewise constant function of time in the Moran model. A 
realisation of the growth of the B allele is determined by a se- 
quence of M pairs {ki,Ti) where ki is the number of copies of 
B in time interval i, and Ti is the duration of this interval (the 
latter begins at ti = '^i)- Th^ sweep begins with ki — 1 

at time ti — 0, and ends with km = '2N at time tm- Thus, we 
have 



Qir) 



Af-l 

E 

i=l 



-rti 



2N -h 
2N 



(13) 



The number M of steps in the growth process fluctuates and 
is usually much greater than 2N — 1 since ki is usually not an 
increasing function of i. 

We construct an increasing growth curve from the sequence 
{ki,Ti) as follows. First, consider the sequence obtained by 
sorting the intervals such that ki < fc^+i. Second, merg- 
ing all intervals with the same value of ki into one contigu- 
ous segment, we obtain a sequence of 2N — 1 segments, 
(k, ^ i,Ti ^ Y.j:kj=i^])' with = I]i=i^j so that t2Ar 
is the duration of the sweep. Note that ti may also be written 
^s X)j -fc <i ^7' which implies t2N = t2N- This 'sorted' sweep 
is monotonous: there are i copies of allele B in the population 
during the time interval [ti,ti^i], and at time ti+i the num- 
ber of copies of B increases by one. Fig. |4] shows that this 
results in a surprisingly accurate representation of the original 
trajectory x{t). This is so because of the conditioning on the 
success of the sweep: large downwards fluctuations of ki are 
rare. 

In terms of the 'sorted' sweep, eq. ( fT3] l can be written as 



Q{r) 



2N-1 

E 

k=l 



2N -k 
2N 



(14) 



by exp{—r{tk)), we find 

27V-1 



(Q(r)) « E 



fe=l 



2N -k 
2N 



(15) 



The expectation values (tk) can be calculated analytically as 
shown in section HV.CI below. In Fig. [51 {Q{r)) according to 
(fTSl l is shown as a blue line, in very good agreement with the 
numerical data (circles). 



B. A deterministic model for x{t) 

Our result dTsl l can be written in the form (fTol l by intro- 
ducing a deterministic model for the sweep. Let k{t) be the 
solution of {tk) = t for k. In Fig. ID k{t) is shown as a solid 
black line. Let x{t) k{t)/{2N). Then 



dire 



(16) 



where r — (t27v) is the expected duration of the sweep. 

In practice, k{t) is obtained as follows: we pick 10"^ lin- 
early spaced values for t in the interval [0, (t27v)]- For each 
value of t, we find the k such that (tk) < t < (t^+i), using 
eq. ( |26] | to calculate the values of (tk). To find the value of 
k corresponding to t, we use linear interpolation between the 
endpoints of this interval. 

Results of coalescent processes based on the model x{t) 
for the selective sweep are summarised in section [Vl] As 
expected the results obtained exhibit equally good agreement 
with our Moran-model simulation as does Fig. [3] 



C. The expected value of tk 



Averaging ( fT4] i over the realisations of the sweep is straight- 
forward. Assuming that (exp(— rife)) can be approximated 



In this section, we derive an analytical expression for (t^.), 
the total time during the whole sweep when there are k copies 
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of B or less, starting from a single copy. More generally, let 

(k) 

Tj be the corresponding time, measured during the remain- 
ing parts of the sweep starting from k copies of B. Thus, we 

have (tk) = (rf^-')). 

The value of (T^'^'^'') equals the expected time until the next 
event, plus the expected time spent in states with k copies of 
B or less from the next state. Thus, we have the recursion 

(7;W) = {n)e,^,+p+{^){rll\) +p_«(r(^)), (17) 

where 9i is one if i > and is zero else, and p±{i) is the 
probability of going from i to i ± 1 copies of B, c.f. Eq. (|7]l- 
In order to find a unique solution to ( [TtI i. we need to provide 
boundary conditions. First, we note that the transition from 
i 1 to i = is forbidden (this is known as a 'natural bound- 
ary condition'). Second, if the sweep is started at i = 2N it 
stops immediately; thus, we must take 



(18) 



for all k. In the following it turns out to be convenient to 
introduce 



Writing ( fTTI l in terms of ^l*^^ leads to a recursion with constant 
coefficients: 

- (1 + ) + CO 41\ = -(1 + Lo)il- Ok-.,. 

(20) 

We solve (|20] | as follows. First, from ( |20] | we obtain a recur- 
sion for the difference a|'^'' ~ — 0^'^' : 



(19) 



A 



(fe) 



L0/\f\~{l+u){l-L0''){T,)eu_ 



(21) 



By telescoping from zero to i, we find the solution 



At i = 0, ( fT9] l implies 0^ ^ = 0, which leads to Aq 
With this, summing ([22] | from to i — 1 leads to 



(22) 



An) 



^ 1 ^ ^ (1-^^)(1-^) 



(fc)^ :^{1-LU'-^){1-LU^) 



(23) 



Setting i — 2N in (l23l l. and using {T2jJ ) = 0, we can solve 

for(ri''^'^): 



)(1-^) 



(24) 



Between each change in fc, there is a geometrically distributed 
number of events. It follows from ([8]l that the expected time 
between two changes in k is 



[Tk) = [Wb^B+WB^bJ 

= 2N/[k{2N ~ k){l+uj)] 



(25) 



generations. Inserting the value of (t^) and writing the solu- 
tion in terms of we obtain 

■fe ^ (2iV - (1 - c^) (1 - c^2W) • ^^'^^ 

Finally, we note that higher moments of tk, especially the vari- 
ance, can be obtained in a similar manner 



V. THE BACKGROUND COALESCENT FOR NEUTRAL 
LOCI IN THE VICINITY OF A SELECTED ONE 

As explained in section [III selection influences, via the 
hitch-hiking effect, the evolution of neutral loci on the same 
chromosome as the selected locus. Given a particular growth 
of the favourable allele frequency x{t) as a function of time, 
what is the evolution of the linked neutral loci? 

The st andard approach is to foUow IKaplan et al\ (Il989l) 
(see also IKaplan et al\. Il988l) in modeling the effect of se- 
lection on the neutral loci as a form of population structure: 
The selective sweep is viewed as a two-island population with 
migration, where one island, with population size 2Nx, con- 
tains the individuals with the B allele; the other island has 
population size 2A^(1 — x) and contains the individuals with 
the b allele. Coalescent events can occur only between in- 
dividuals on the same island. Recombination, however, may 
move a line from one island to the other, since the parent of 
the second product of the recombination event is chosen uni- 
formly from the whole population. 

It is useful to write the total rate of coalescent and recom- 
bination events in the subdivided population in the form 



Atot = AbPb + AbPb, 



(27) 



where Ab and Ab are the total number of birth-death events 
per generation in the B and b sub-populations, respectively, is 
given by 



Ab = 2Nx, 

Ab ^2N{l-x), 



(28) 



and where and pb are the probabilities that a single birth- 
death event leads to a coalescent or recombination event 
(or both) involving an individual in the corresponding sub- 
population. 

Consider the probability p-g,. First, a birth-death event has 
no effect on the gene genealogies unless the individual born 
is an ancestor to a locus of an individual in the sample. The 
probability that this is the case is simply nB/{2Nx), where 
riB is the number of ancestral lines currently in the B sub- 
population. Second, in order for the gene genealogies to 
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change either recombination must happen during the birth - 
this happens with probabiHty r - or the parent must belong 
to a different ancestral line of the sample; the probability 
that this happens is (tib — l)/{2Nx). Since one of the sub- 
populations can be quite small, especially close to the ends of 
the sw eep, we cannot make the usual assumption (IHudsonI 
Il990h that recombination and coalescence cannot occur in the 
same event. Putting it all together, we find 



Pb 



riB 

2Nx 



(1-r) 



2Nx 



(29) 



The first term corresponds to two lines coalescing in the B 
population with no recombination, and the second term corre- 
sponds to all events involving recombination. 

We derive the probability pb of an event in the b sub- 
population in the same way as for p^- The result is 



Ph 



2N{l-x) 



(1 



nb - 1 
2N{l-x) 



(30) 



where, correspondingly, is the number of ancestral lines 
currently in the b sub-population. 

When X and the other parameters are constant, the coa- 
lescent is a Poisson process, and the time to the next event 
is exponentially distributed with expected value 1/Atot, see 
Eq. (IZTT i. In a selective sweep, however, x changes with time; 
hence, the coalescent is an inhomogeneous Poisson process. 
Given the state of the population at time ti, the distribution 
/(^2 |ii) of the time t2 of the next event is 



fit2\ti) = Xtot{x{t2)) exp 



Xtot{x{t)) dt 



(31) 



Hence, given that we have simulated the sweep from the end 
of the sweep to time ti, the time t2 of the next event is deter- 
mined by solving the equation 



Xtot{x{t)) dt^r] 



(32) 



numerically for t2, where rj is an exponentially distributed 
variable with expected value unity. For some simple growth 
models it is possible to find explicit analytical expressions for 
t2 as a function of ti and 77; mostly, however, one must use nu- 
merical approximations of the integral. In this paper, we con- 
sider x{t) in (|32] | to be a given, piecewise constant function. 
Also when we have explicit expressions for x{t) it is conve- 
nient, and efficient, to take a number of samples at equally 
spaced points in time. We are then able to quickly find the 
interval containing the value of t2 that solves ( |32] | (if x{t) is 
piecewise constant, the left-hand side of ( [32l ) is piecewise lin- 
ear and continuous). 

This concludes our review of the standard background coa- 
lescent. There is only one problem with this picture: the rates 
Ab and Ab do not accurately describe the rate of birth-death 
events in the two sub-populations when we compare to simu- 
lations using the Moran-model algorithm described in section 
Hm we observe slight but statistically significant deviations 



X 10 



I 



m 




X 10 



FIG. 5 Shows the birth rate of B alleles, Ab, as a function of x for 
= 10*, s = 0.01, and 10* Moran simulations (white circles). 
Also shown is the theory developed below (solid blue line). Note 
that the standard rates J28b correspond to Ab ~ 2Nx. 



for large values of s (we find that the effect is negligible for 
s < 0.03, and is most significant when both s and r are rela- 
tively large). 

As is shown in Fig. |5] the true birth rate of B alleles as a 
function of x in the Moran model is given by the total rate of 
all events leading to the birth of a B allele: combining eqs. ^ 
and we have 



Ab = wb^b + Wb^B 

sx{l — x) 



2N 



2Nx 



(33) 



Hence, the birth rate of B alleles is larger than expected from 
the standard model. Since the total number of events is fixed 
at {2NY per unit of time, the birth-rate of the b alleles is 
correspondingly smaller: 



Ab - 2A^ - Ab 



(34) 



In general, we see that deviations from the standard rates are 
due to the difference in the birth rates of the two alleles. It 
is the selection process which causes extra births to happen in 
the B sub-population, and fewer births in the b sub-population. 

In Fig.|6]we illustrate the difference between choosing the 
birth-rates according to the standard method ( l28T l. and accord- 
ing to ( l33T l, by measuring the probability p2inb that two an- 
cestral lines of a neutral locus escape the sweep separately. 
The parameters are = 10* and s — 0.01, corresponding 
to moderately strong selection. The background coalescent 
using Ab from ( |33] | is in good agreement with the Moran sim- 
ulations, while the results using the rates ( |28l ) exhibit a small 
but significant difference. Other quantities exhibit similar dif- 
ferences (not shown). 
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FIG. 6 Probability p2inb that two ancestral lines of a neutral locus 
escape the sweep separately, as a function of the amount of recom- 
bination r between the neutral and the selected locus. Shown are re- 
sults of Moran-model simulations (circles), results of the background 
coalescent with the growth x{t) given by sampling the Moran pro- 
cess for the selected locus, using either the standard rates in the liter- 
ature ( 128b . red dashed line, or the newrates l l33b and ( 134b . blue solid 
line. T he coalescent simulations of lDURRETT and Schwe insbergI 
(triangles) are consistent with the former, while our Moran 
model is much closer to the latter. The parameters are: A'^ — 10* and 
s = 0.1. 



VI. RESULTS AND DISCUSSION 

We have implemented the background coalescent for a con- 
tiguous segment of neutral loci close to a selected site (section 
W\ using the deterministic model = k{t)/{2N) described 
in section |TV] k{t) is obtained by solving (tk) = t for k, as 
described in section HV.B I 

To establish the accuracy of our algorithm, we compare its 
results to those of Moran-model simulations. In particular we 
compute th e distribution over partitions at a neutr al locus in 
the sample dOuRRETT and SchweinsbergI|2004 explained 
below in section B). 




A. Duration of the sweep 



FIG. 7 Comparison of the exact expression i35\ , symbols, for the 
expected duration of the sweep (in units of 2N generations) as a 
function of s, to the approximation (c.f. eq. [37] solid blue lines) 
and the logistic model (solid red lines), for = 10"^ (squares) and 
= 10* (circles). As a reference, the result (36) is also shown 
(dotted line). 



Here 7 is Euler's constant, 7 w 0.577216. This approximation 
is excellent: as is shown in Fig. |7] the approximation breaks 
down only when 2Ns < 2. Except for the 7-term, (|37] l is 
also the expected duration of the sweep one obtains in the dif- 
fusion approximation for the sweep conditioned on success 
(IEtheridge et a/.ll2006l Lemma 3.1). 

This result should be contrasted with the deterministic lo- 
gistic sweep, where the duration of the sweep is 2 log(2A^ — 
l)/s. For large values of s, the duration is close to that of both 
the Moran model and to the approximation eq. dSTT i. Thus, 
quantities depending primarily on the duration of the sweep, 
such as the amount of recombination taking place during the 
sweep, will be accurately described in the logistic model when 
the selection is strong. From ( l37b . and in Fig. |2l we see that 
this happens when |log(s)| is small compared to log(2Af). 
When s is small, however, the duration of the sweep in the 
logistic model is very different from that of the Moran model, 
and consequently we expect a clear difference in the effect of 
the sweep on the neutral loci nearby. 



According to the results in section HV.CI we can use ( |26] l to 
obtain a closed expression for {t2N), the expected duration of 
the sweep. Because of symmetry, we can write {t2N) in the 
form 

^^^ 2(l-c.^^-^) (1-..^) 

^^^^^ - fc(l-C.)(l-C.2^) ■ ^^^^ 

In the limit s — > 0, we obtain the familiar result (see, e.g., 
lEWENsl [19791 for a review) 

When 2Ns is large, we approximate cj^^ « 0, and obtain to 
leading order 

(f,„).2l2SeM±2. ,37) 

S 



B. Partitions 

In this subsection, we consider the distribution of par- 
titions at a neutral locus at distance r from the selected 
locus in a sample of two individuals in the pop u lation . 
The partitions are defined as follows (IDonnellyI Il986t 
IDurrett and SchweinsbergI |2004 . Suppose we follow 
the ancestral lines of the neutral locus in the two individuals 
back in time through the sweep. Because of recombination, 
the lines may move from the B population to the b popula- 
tion, and (with a rather small probability) back again. They 
may coalesce in one of the populations, or stay separate dur- 
ing the whole sweep. For two lines, we have four distinct 
cases: both lines coalesce during the sweep and the resulting 
line is trapped by the sweep (the probability for this to hap- 
pen is denoted by p2cinE); one line escapes the sweep and the 
other is trapped (plBlb); both lines escape the sweep but do 
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FIG. 8 The distribution over the partitions as a function of the ge- 
netic distance r from the selected locus. We show one panel for each 
of the four partitions. The population size is TV = 10^, and the se- 
lection parameter s = 0.1. The data shown are: Moran simulations 
(circles), logistic model (dashed black line), our own model (solid 
blue line), the DS-algorithm (dash-dotted red l ine), and coalescent 
simulations of lDURRETT and SchweinsbergH2004.) (triangles). 



not coalesce {p2inb)\ the lines coalesce and then escape, or es- 
cape separately and then coalesce (much less likely), denoted 
by p2cinb. 

Far away from the sweep, one expects all lines to escape 
independently. For large population sizes it is unlikely that 
lines coalesce during the sweep, but it becomes more common 
when the population size is relatively low (e.g., for N ^ 10'^). 
Close to the selected locus, nearly all lines are trapped in the 
sweep. The frequency of the case where one line is trapped 
and the other line escapes has a maximum for intermediate 
genetic distances r. 

In Fig. [8] we compare the four models: the Moran model, 
the logistic-sweep model, the DS-algorithm, and our own 
algorithm, when N = 10* and s = 0.1, correspond- 
ing to strong selection. Also shown are the coal escent 
simulations of IDURRETT and SchweinsbergI (|2004 . The 
plot covers the approximate range of validity quoted by 
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FIG. 9 As Fig.m but for s = 0.03. 

IDurrett and SchweinsbergI (|2004 for their algorithm: 
r < s/\n2N which evaluates to « 0.01. Over this range, 
all curves except the logistic model agree. In particular, the 
logistic model gives a higher value for p2cinb than expected; 
the most likely reason for this deviation is that the duration of 
the sweep is slightly too long in the logistic model (c.f. Fig.|7]). 

Figs. |9] and [TOl show the same quantities as Fig. [8] but for 
s = 0.03 and s — 0.001, respectively. The range of validity 
of the DS-algorithm is r < s/ ln2N which is 0.003 in Fig.|ll 
and 10^** in Fig.[TO] Within this range, all curves except the 
logistic model agree approximately. 

For larger values of r, the most important contribution 
to the difference between the Moran model and the DS- 
algorithm is that the latter ignores recombination events and 
coalescent events during the middle and late stages of the 
sweep. As can be seen in the figures, this is a very good ap- 
proximation provided r is sufficiently small, or provided the 
sweep is sufficiently short. The accuracy of the logistic model 
quickly deterioates as s decreases. Again, the most important 
reason is that the sweep is too long compared to the Moran 
model. 

Our algorithm, by contrast works well also for large values 



11 




0.2 0.4 0.6 0.8 1 




FIG. 10 As Fig.m but for s = 0.001. 



Our algorithm is as easily implemented as the stan- 
dard logistic model, but is far more accurate, even 
apphcable beyond the range of p arameters given by 
IDurrett and S chweinsberg' ('2004') for their algorithm. 
For practical purposes it is important that the sweep is not as- 
sumed to happen instantaneously, so mutations occuring dur- 
ing the sweep are not neglected. Furthermore, the algorithm 
determines the fate of a contiguous segment of neutral loci in 
the vicinity of the selected locus. Figs. 8-10, for example, 
were obtained by one single run of our algorithm. 

Our results have implications beyond the immediate con- 
text of this article. First, we introduced a new approximate 
representation of selective sweeps (the 'sorted' sweep) which 
locally averages over fluctuations in the favoured-allele fre- 
quency. We suspect that this approximation retains the fluctu- 
ations relevant for an accurate description of the genealogies 
of neutral loci close to the selected site. In which range of pa- 
rameters this is true will be the subject of a subsequent study. 
Second, in the coalescent for the neutral loci, we have shown 
that the standard expression for the rates ( |28] | must be mod- 
ified. We expect that similar modifications are necessary in 
other cases, e.g. Moran models with changing population- 
sizes, as for instance in population expansions and bottle- 
necks. 

We conclude with describing a possible application for our 
model. It will be of use in efficiently and accurately de- 
termining log-likelihood surfaces for the parameters s and 
N in the Moran model of directional selection (see, e.g., 
ICOOP and Gri ffithsI |2004|) where an accurate and compu- 
tationally efficient model is required. We believe that our de- 
terministic approximation will be of use in this context. 
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of r and small values of s, although it is clear that the devia- 
tions from the Moran model become larger for smaller values 
of s. This is to be expected since the fluctuations of the sweep 
increase with decreasing s. 

Last but not least we emphasize that the curves in Figs. [8]- 
[TO]are obtained by a single run of our program for a contigu- 
ous stretch of DNA adjacent to the selected site. The DS- 
algorithm requires a separate simulation for each value of r. 



VII. CONCLUSIONS 

We have implemented a new model for genetic hitch-hiking 
based on a deterministic approximation for the growth of 
the favoured-allele frequency during the selective sweep, in 
combination with a coalescent process for a locus (or set of 
loci) close to the selected locus. By comparison with direct 
Moran-model simulations we could show that our new model 
is very accurate. Two reasons for this success are: our model 
faithfully approximates the expected duration of the selective 
sweep, and it is conditioned on the success of the sweep. 
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